{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 利用Python进行数据分析 chapter5 pandas基础\n",
    "\n",
    "> 2023/2/1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 201,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基本数据结构"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Series\n",
    "   \n",
    "   类似一维数组，由数据和索引组成"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 202,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "obj <class 'pandas.core.series.Series'> 0    0\n",
      "1    1\n",
      "2    2\n",
      "dtype: int64\n",
      "obj.value <class 'numpy.ndarray'> [0 1 2]\n",
      "obj.index <class 'pandas.core.indexes.range.RangeIndex'> RangeIndex(start=0, stop=3, step=1)\n"
     ]
    }
   ],
   "source": [
    "obj = pd.Series([i for i in range(3)])\n",
    "print('obj',type(obj),obj)\n",
    "print('obj.value',type(obj.values),obj.values)\n",
    "print('obj.index',type(obj.index),obj.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 203,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "a    0\nb    1\nc    2\ndtype: int64"
     },
     "execution_count": 203,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 创建Series时候可以指定索引\n",
    "# 索引长度需要一直\n",
    "obj = pd.Series([i for i in range(3)],index=['a','b','c'])\n",
    "obj"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "Series可以通过索引选择其中的一个或者一组值\n",
    "\n",
    " 一组值（列毕哦啊）的话返回的还是Series\n",
    "\n",
    " 对Series同样可以使用切片进行索引\n",
    "\n",
    " 列表索引和切片查询若长度不唯一那么还是返回的Series"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 204,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "索引 0\n",
      "列表索引 a    0\n",
      "c    2\n",
      "dtype: int64\n",
      "切片 a    0\n",
      "b    1\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print('索引',obj['a'])\n",
    "print('列表索引',obj[['a','c']])\n",
    "print('切片',obj[:2])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Numpy.ndarray的数组运算同样可以用到Series上，而且Series的 **索引/值** 间的关系依旧存在\n",
    "\n",
    "可以将Series看成是一个定长的有序字典 \n",
    "\n",
    "同样可以通过字典来创建Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 205,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "True"
     },
     "execution_count": 205,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'b' in obj"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 206,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "a    4.0\nb    3.0\ndtype: float64"
     },
     "execution_count": 206,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 使用字典创建Series\n",
    "sdata = {'a':4,'b':3}\n",
    "obj1 = pd.Series(sdata,dtype=float)\n",
    "obj1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 207,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "a    4.0\nc    NaN\nb    3.0\ndtype: float64"
     },
     "execution_count": 207,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 对于索引对不上的会填充NaN\n",
    "sdata2 = ['a','c','b']\n",
    "obj2 = pd.Series(sdata,index=sdata2)\n",
    "obj2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 208,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "a    8.0\nb    6.0\nc    NaN\ndtype: float64"
     },
     "execution_count": 208,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Series的算术运算中会自动根据索引来进行对其\n",
    "obj1 + obj2"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Series对象有一个name属性，同时可以通过赋值的方法直接修改Series的索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 209,
   "outputs": [
    {
     "data": {
      "text/plain": "c    4.0\nd    3.0\ndtype: float64"
     },
     "execution_count": 209,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obj1.index = ['c','d']\n",
    "obj1"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### DataFrame\n",
    "   \n",
    "   表格型的数据结构。\n",
    "   \n",
    "   每列可以是不同值类型的数据\n",
    "\n",
    "   DF既有行索引，也有列索引，这个和excel差不多了\n",
    "\n",
    "   当然，也可以用DF来存储高维的数据类型"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "使用 del 关键字可以删除DF的列"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 210,
   "outputs": [
    {
     "data": {
      "text/plain": "   a   b   c\nw  0   1   2\nx  3   4   5\ny  6   7   8\nz  9  10  11",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>0</td>\n      <td>1</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>3</td>\n      <td>4</td>\n      <td>5</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>6</td>\n      <td>7</td>\n      <td>8</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9</td>\n      <td>10</td>\n      <td>11</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 210,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame = pd.DataFrame(np.arange(12).reshape((4,3)),\n",
    "                     columns=list('abc'),\n",
    "                     index=list('wxyz'),\n",
    "                     dtype=int)\n",
    "frame"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "##### DF查询数据\n",
    "\n",
    "df.loc() 根据行列的标签值来进行查询\n",
    "\n",
    "df.iloc() 根据行列的索引（整数）进行查询\n",
    "\n",
    "df.where()\n",
    "\n",
    "df.query()\n",
    "\n",
    "df[columnname]取列 `columnname`可以是一个列表\n",
    "\n",
    "以上都是可以直接使用切片操作的"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 211,
   "outputs": [
    {
     "data": {
      "text/plain": "a    0\nb    1\nc    2\nName: w, dtype: int64"
     },
     "execution_count": 211,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 单行\n",
    "series = frame.loc['w']\n",
    "series"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 212,
   "outputs": [
    {
     "data": {
      "text/plain": "   w  y\na  0  6\nb  1  7\nc  2  8",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>w</th>\n      <th>y</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>a</th>\n      <td>0</td>\n      <td>6</td>\n    </tr>\n    <tr>\n      <th>b</th>\n      <td>1</td>\n      <td>7</td>\n    </tr>\n    <tr>\n      <th>c</th>\n      <td>2</td>\n      <td>8</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 212,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 多行，使用列表进行缩影\n",
    "s1 = frame.loc[['w','y']]\n",
    "s1.T"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 213,
   "outputs": [
    {
     "data": {
      "text/plain": "   a  b  c\nw  0  1  2\nx  3  4  5\ny  6  7  8",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>0</td>\n      <td>1</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>3</td>\n      <td>4</td>\n      <td>5</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>6</td>\n      <td>7</td>\n      <td>8</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 213,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 切片截取部分\n",
    "frame.loc['w':'y','a':'c']"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 214,
   "outputs": [
    {
     "data": {
      "text/plain": "   a   b   c\ny  6   7   8\nz  9  10  11",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>y</th>\n      <td>6</td>\n      <td>7</td>\n      <td>8</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9</td>\n      <td>10</td>\n      <td>11</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 214,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 根据索引整数进行查询\n",
    "frame.iloc[2:]"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "##### DF的运算"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 215,
   "outputs": [
    {
     "data": {
      "text/plain": "   a  b  c\nw  0  0  0\nx  3  3  3\ny  6  6  6\nz  9  9  9",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>3</td>\n      <td>3</td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>6</td>\n      <td>6</td>\n      <td>6</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9</td>\n      <td>9</td>\n      <td>9</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 215,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame - series"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 216,
   "outputs": [
    {
     "data": {
      "text/plain": "   a   b   c\nw  0   1   2\nx  3   4   5\ny  6   7   8\nz  9  10  11",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>0</td>\n      <td>1</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>3</td>\n      <td>4</td>\n      <td>5</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>6</td>\n      <td>7</td>\n      <td>8</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9</td>\n      <td>10</td>\n      <td>11</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 216,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame = pd.DataFrame(np.arange(12).reshape((4,3)),\n",
    "                     columns=list('abc'),\n",
    "                     index=list('wxyz'),\n",
    "                     dtype=int)\n",
    "frame"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "DF创建新的列也很简单"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 217,
   "outputs": [
    {
     "data": {
      "text/plain": "   a   b   c  a+b\nw  0   1   2    1\nx  3   4   5    7\ny  6   7   8   13\nz  9  10  11   19",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n      <th>a+b</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>0</td>\n      <td>1</td>\n      <td>2</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>3</td>\n      <td>4</td>\n      <td>5</td>\n      <td>7</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>6</td>\n      <td>7</td>\n      <td>8</td>\n      <td>13</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9</td>\n      <td>10</td>\n      <td>11</td>\n      <td>19</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 217,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame.loc[:,'a+b']=frame.loc[:,'a']+frame.loc[:,'b']\n",
    "frame"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "##### DF的描述统计"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 218,
   "outputs": [
    {
     "data": {
      "text/plain": "   a   b   c\nw  0   1   2\nx  3   4   5\ny  6   7   8\nz  9  10  11",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>0</td>\n      <td>1</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>3</td>\n      <td>4</td>\n      <td>5</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>6</td>\n      <td>7</td>\n      <td>8</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9</td>\n      <td>10</td>\n      <td>11</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 218,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame = pd.DataFrame(np.arange(12).reshape((4,3)),\n",
    "                     columns=list('abc'),\n",
    "                     index=list('wxyz'),\n",
    "                     dtype=int)\n",
    "frame"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 219,
   "outputs": [
    {
     "data": {
      "text/plain": "a    18\nb    22\nc    26\ndtype: int64"
     },
     "execution_count": 219,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame.sum() # 按列进行统计归总"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 220,
   "outputs": [
    {
     "data": {
      "text/plain": "w     3\nx    12\ny    21\nz    30\ndtype: int64"
     },
     "execution_count": 220,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame.sum(axis=1) # 按行进行归总"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "参数说明\n",
    "\n",
    "* axis轴向，0是行，1是列\n",
    "* skipna 排除NaN，默认True\n",
    "* level 层次化的一个参数，不管吧"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "idxmin,idxmax方法返回的是最小值最大值的索引\n",
    "\n",
    "argmin,argmin方法返回的是最小值最大值的索引位置（整数类型）"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 221,
   "outputs": [
    {
     "data": {
      "text/plain": "a    z\nb    z\nc    z\ndtype: object"
     },
     "execution_count": 221,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame.idxmax()"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 222,
   "outputs": [
    {
     "data": {
      "text/plain": "w    a\nx    a\ny    a\nz    a\ndtype: object"
     },
     "execution_count": 222,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame.idxmin(axis=1)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 223,
   "outputs": [
    {
     "data": {
      "text/plain": "              a          b          c\ncount  4.000000   4.000000   4.000000\nmean   4.500000   5.500000   6.500000\nstd    3.872983   3.872983   3.872983\nmin    0.000000   1.000000   2.000000\n25%    2.250000   3.250000   4.250000\n50%    4.500000   5.500000   6.500000\n75%    6.750000   7.750000   8.750000\nmax    9.000000  10.000000  11.000000",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>count</th>\n      <td>4.000000</td>\n      <td>4.000000</td>\n      <td>4.000000</td>\n    </tr>\n    <tr>\n      <th>mean</th>\n      <td>4.500000</td>\n      <td>5.500000</td>\n      <td>6.500000</td>\n    </tr>\n    <tr>\n      <th>std</th>\n      <td>3.872983</td>\n      <td>3.872983</td>\n      <td>3.872983</td>\n    </tr>\n    <tr>\n      <th>min</th>\n      <td>0.000000</td>\n      <td>1.000000</td>\n      <td>2.000000</td>\n    </tr>\n    <tr>\n      <th>25%</th>\n      <td>2.250000</td>\n      <td>3.250000</td>\n      <td>4.250000</td>\n    </tr>\n    <tr>\n      <th>50%</th>\n      <td>4.500000</td>\n      <td>5.500000</td>\n      <td>6.500000</td>\n    </tr>\n    <tr>\n      <th>75%</th>\n      <td>6.750000</td>\n      <td>7.750000</td>\n      <td>8.750000</td>\n    </tr>\n    <tr>\n      <th>max</th>\n      <td>9.000000</td>\n      <td>10.000000</td>\n      <td>11.000000</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 223,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame.describe()"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### pandas处理NaN\n",
    "\n",
    "NaN是np.nan,同时Python自己的None也被当作是NaN\n",
    "\n",
    "* dropna 是进行过滤\n",
    "* fillna 使用指定值或插值来进行填充\n",
    "* isnull 返回bool\n",
    "* notnull isnull的非"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "##### dropna 过滤"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 224,
   "outputs": [
    {
     "data": {
      "text/plain": "0    1.0\n1    NaN\n2    3.0\n3    NaN\ndtype: float64"
     },
     "execution_count": 224,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.Series([1,np.nan,3,None])\n",
    "data"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 234,
   "outputs": [
    {
     "data": {
      "text/plain": "0    False\n1     True\n2    False\n3     True\ndtype: bool"
     },
     "execution_count": 234,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.isnull()"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 225,
   "outputs": [
    {
     "data": {
      "text/plain": "0    1.0\n2    3.0\ndtype: float64"
     },
     "execution_count": 225,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.dropna()"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 226,
   "outputs": [
    {
     "data": {
      "text/plain": "0    1.0\n2    3.0\ndtype: float64"
     },
     "execution_count": 226,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[data.notnull()] # 通过bool索引的方法来过滤缺失值"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "DF对NaN的处理，则dropna会默认丢弃含有NaN的行"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 235,
   "outputs": [
    {
     "data": {
      "text/plain": "   a   b   c\nw  0   1   2\nx  3   4   5\ny  6   7   8\nz  9  10  11",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>0</td>\n      <td>1</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>3</td>\n      <td>4</td>\n      <td>5</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>6</td>\n      <td>7</td>\n      <td>8</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9</td>\n      <td>10</td>\n      <td>11</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 235,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame = pd.DataFrame(np.arange(12).reshape((4,3)),\n",
    "                     columns=list('abc'),\n",
    "                     index=list('wxyz'),\n",
    "                     dtype=int)\n",
    "frame"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 236,
   "outputs": [
    {
     "data": {
      "text/plain": "     a   b     c\nw  NaN   1   NaN\nx  3.0   4   NaN\ny  6.0   7   8.0\nz  9.0  10  11.0",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>NaN</td>\n      <td>1</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>3.0</td>\n      <td>4</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>6.0</td>\n      <td>7</td>\n      <td>8.0</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9.0</td>\n      <td>10</td>\n      <td>11.0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 236,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame.loc['w','a'] = np.nan\n",
    "frame.loc['w','c'] = np.nan\n",
    "frame.loc['x','c'] = np.nan\n",
    "frame"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 229,
   "outputs": [
    {
     "data": {
      "text/plain": "     a   b     c\ny  6.0   7   8.0\nz  9.0  10  11.0",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>y</th>\n      <td>6.0</td>\n      <td>7</td>\n      <td>8.0</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9.0</td>\n      <td>10</td>\n      <td>11.0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 229,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cleared = frame.dropna()  # 此时是丢弃了所有含有NAN的行\n",
    "cleared"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 237,
   "outputs": [
    {
     "data": {
      "text/plain": "    b\nw   1\nx   4\ny   7\nz  10",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>b</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>4</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>7</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>10</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 237,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cleared = frame.dropna(axis=1)\n",
    "cleared"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 238,
   "outputs": [
    {
     "data": {
      "text/plain": "     a   b     c\nw  NaN   1   NaN\nx  3.0   4   NaN\ny  6.0   7   8.0\nz  9.0  10  11.0",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>NaN</td>\n      <td>1</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>3.0</td>\n      <td>4</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>6.0</td>\n      <td>7</td>\n      <td>8.0</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9.0</td>\n      <td>10</td>\n      <td>11.0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 238,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cleared = frame.dropna(how='all') # 丢弃全部是NaN的行\n",
    "cleared['d'] = np.nan\n",
    "cleared.dropna(how='all',axis=1)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "##### fillna 填充"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 231,
   "outputs": [
    {
     "data": {
      "text/plain": "     a   b     c\nw  NaN   1   NaN\nx  3.0   4   NaN\ny  6.0   7   8.0\nz  9.0  10  11.0",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>NaN</td>\n      <td>1</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>3.0</td>\n      <td>4</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>6.0</td>\n      <td>7</td>\n      <td>8.0</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9.0</td>\n      <td>10</td>\n      <td>11.0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 231,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 232,
   "outputs": [
    {
     "data": {
      "text/plain": "     a   b     c\nw  1.0   1   1.0\nx  3.0   4   1.0\ny  6.0   7   8.0\nz  9.0  10  11.0",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>1.0</td>\n      <td>1</td>\n      <td>1.0</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>3.0</td>\n      <td>4</td>\n      <td>1.0</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>6.0</td>\n      <td>7</td>\n      <td>8.0</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9.0</td>\n      <td>10</td>\n      <td>11.0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 232,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "frame.fillna(1)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 233,
   "outputs": [
    {
     "data": {
      "text/plain": "      a   b     c\nw  99.0   1  -1.0\nx   3.0   4  -1.0\ny   6.0   7   8.0\nz   9.0  10  11.0",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>a</th>\n      <th>b</th>\n      <th>c</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>w</th>\n      <td>99.0</td>\n      <td>1</td>\n      <td>-1.0</td>\n    </tr>\n    <tr>\n      <th>x</th>\n      <td>3.0</td>\n      <td>4</td>\n      <td>-1.0</td>\n    </tr>\n    <tr>\n      <th>y</th>\n      <td>6.0</td>\n      <td>7</td>\n      <td>8.0</td>\n    </tr>\n    <tr>\n      <th>z</th>\n      <td>9.0</td>\n      <td>10</td>\n      <td>11.0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 233,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 可以对指定行填充不同的值\n",
    "frame.fillna({'a':99,'c':-1})"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "fillna默认是返回一个新的对象，若要对现有对象进行修改，可以使用inplace=True这个参数"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "collapsed": false
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "3c874ee3d65450ef899cacaa06bdca40f9e4450c5e631a94911db3215dbda5b8"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
